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Abstract 

We present an acceleration of the well-established Krylov-Ritz methods to compute the sign function of large complex 
matrices, as needed in lattice QCD simulations involving the overlap Dirac operator at both zero and nonzero baryon 
density. Krylov-Ritz methods approximate the sign function using a projection on a Krylov subspace. To achieve a 
high accuracy this subspace must be taken quite large, which makes the method too costly. The new idea is to make a 
further projection on an even smaller, nested Krylov subspace. If additionally an intermediate preconditioning step is 
applied, this projection can be performed without affecting the accuracy of the approximation, and a substantial gain 
in efficiency is achieved for both Hermitian and non-Hermitian matrices. The numerical efficiency of the method is 
demonstrated on lattice configurations of sizes ranging from 4 4 to 10 4 , and the new results are compared with those 
obtained with rational approximation methods. 



1. Introduction 

In quantum chromodynamics (QCD) some physical observables rely on the chiral properties of the theory. To 
study such observables in a lattice formulation of QCD it is important to discretize the Dirac operator such that it 
respects the corresponding chiral symmetry. This is most faithfully achieved using the overlap Dirac operator |Q]|2l- 
To study QCD at nonzero baryon density the overlap formulation was recently extended to include a quark chemical 
potential (3] 0. A major ingredient in the overlap operator, which makes its use very challenging, is the computation 
of the sign function of a complex matrix, which is Hermitian at zero baryon density, but becomes non-Hermitian when 
a quark chemical potential is introduced. 

The search for efficient numerical methods to compute the sign function for the large sparse matrices encountered 
in this context is an ongoing field of research. Typically, Krylov subspace methods are employed to evaluate the op- 
eration of a matrix function on an arbitrary vector. We distinguish two main variants: the Krylov-Ritz approximation, 
which evaluates the function via a projection on the Krylov subspace, and the rational approximation, where the func- 
tion is first approximated by a partial fraction expansion, which is then efficiently solved using a multi-shift Krylov 
subspace inverter. 

In the Hermitian case efficient rational approximation methods for the sign function have been devised [5 6 | and 
are currently being used in large scale lattice simulations. The current method of choice uses the Zolotarev partial 
fraction expansion (6]|7][8], which yields the optimal rational approximation to the sign function over a real interval 
191 , in conjunction with a multi-shift conjugate gradient inversion. For non-Hermitian matrices, which occur in the 
presence of a quark chemical potential, Krylov subspace approximations to the sign function are relatively new and 
still under development. Recently, partial fraction expansion methods using the Neuberger expansion [5| with non- 
Hermitian multi-shift inverters were proposed [ 10 1. 

The Krylov-Ritz approximation, which we discuss in this paper, is based on the construction of a Krylov basis 
and its accompanying Ritz matrix. Depending on the algorithm used to construct the basis we distinguish between 
the Lanczos approximation in the Hermitian case J6], and the Arnoldi approximation IfTTII or two-sided Lanczos 
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approximation [ 12] in the non-Hermitian case. The latter clearly yields the more efficient function approximation for 
non-Hermitian matrices fT21 . In the Krylov-Ritz approximation the large complex matrix is projected on the Krylov 
subspace, and its sign function is approximated by lifting the sign function of its projected image (Ritz matrix) back 
to the original space. The latter sign function is computed to high accuracy using the spectral definition of a matrix 
function or using a matrix-iterative method. When a large Krylov subspace is needed to reach the desired accuracy, 
the computation of this matrix sign function becomes a bottleneck for the algorithm. 

Herein we will introduce an enhancement of the Krylov-Ritz approximation method which substantially reduces 
the cost of this internal sign computation and boosts the efficiency of the overall method, such that it competes with, 
and even surpasses, the rational function approximation in both the Hermitian and non-Hermitian case. The dramatic 
reduction in computation time is achieved by projecting the Ritz matrix on an even smaller, nested Krylov subspace, 
after performing a suitable preconditioning step first. The desired sign function is then computed via the sign function 
of the inner Ritz matrix, which yields the same accuracy as the original Krylov-Ritz approximation. 

The outline of the paper is as follows. In Sec. [2] we introduce the overlap operator and the matrix sign function. In 
Sec.[3]we show how the matrix function of large matrices is computed using Krylov-Ritz approximation methods. In 
Sec.B] we introduce the nested Krylov subspace method, which substantially enhances the efficiency of the Krylov- 
Ritz approximation to the sign function. We study its convergence properties and present numerical results for various 
lattice sizes, including a comparison with rational approximation methods. Finally, our conclusions are given in 
Sec. [5] For completeness we have added some algorithms in Appendix. 

2. Overlap operator and the matrix sign function 

Our motivation to develop numerical algorithms to compute the matrix sign function of large, sparse, complex 
matrices comes from its application in lattice quantum chromodynamics (LQCD). The overlap formulation of the 
Dirac operator [1,2], which ensures that chiral symmetry is preserved in LQCD, is given in terms of the matrix sign 
function lfl3l . and its definition in the presence of a quark chemical potential /j. [3 1 is given by 

D ov 0u) = 1+75 sgn( r5 D w (^)), (1) 

where 1 denotes the identity matrix, 75 = 71727374 with 71 , . . . , 74 the Dirac gamma matrices in Euclidean space, sgn 
is the matrix sign function, and 
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D w Qx) = 1 - kJ](T; + 77) - Kie^l + e-n A ) (2) 

!=1 

is the Wilson Dirac operator at nonzero chemical potential [ 14 1 with (T„ ) yx = (1 + y v )U x , ±v 6 yrX± i,, « - 1/(8 + 2m w ), 
m w 6 (-2, 0) and U x ,± v € SU(3), where U x - V = U x -t y+v . The exponential factors e ±fI implement the quark chemical 
potential on the lattice. For fi = the argument of the sign function is Hermitian, while for /i + it is non- 
Hermitian. To compute the overlap operator we need to define the matrix sign function for a general complex matrix 
A of dimension n. A generic matrix function /(A) can be defined by 

f(A) = ^-.^mizi-AyUz, (3) 

where T is a collection of contours in C such that / is analytic inside and on F and such that T encloses the spectrum of 
A. If A is diagonalizable, i.e., A = UAU~ l , with diagonal eigenvalue matrix A = diag(/li, . . . , A n ) and U e Gl(n, C), 
then this general definition can be simplified to the well-known spectral form 

/(A) = Uf{K)U~\ (4) 

with 

/(A) = diag (f(Ai), f(A„)) . (5) 



2 



If A cannot be diagonalized, a spectral definition of /(A) can still be derived using the Jordan decomposition IPT31 . 
For simplicity, but without loss of generality, we assume diagonalizability in the following. For Hermitian A the 
eigenvalues are real and their sign is defined by sgn(x) = ±1 for x ^ with x € K, such that Eq. Q readily defines the 
matrix sign function. For non-Hermitian A the eigenvalues are complex and require a definition of sgn(z) for zeC. 
The sign function needs to satisfy (sgn(z)) 2 = 1 and reproduce the usual sgn(x) for real x. We define 

z 

sgn(z) = — = sgn (Re(z)) , (6) 
Vz 2 

where the cut of the square root is chosen along the negative real axis. This choice, although not unique, gives the 
correct physical result for the overlap Dirac operator in Eq. ([TJ (see Ref. [4]). 



3. Krylov-Ritz approximations for matrix functions 

Since we aim at problems with large matrices, as is the case in LQCD, memory and computing power limitations 
require sophisticated methods to deal with the sign function. For a matrix A of large dimension n the common 
approach is not to compute f(A) but rather its action on a vector, i.e., y = f(A)x, which is needed by iterative inverters 
to compute f(A)~ l b or by iterative eigenvalues solvers for f(A). The Krylov-Ritz method approximates the resulting 
vector in the Krylov subspace 

K k (A,x) = spm(x,Ax,A 2 x, . . . ,A k ~ x x) (7) 

of C", implicitly making a polynomial approximation of degree k - 1 to f(A). The optimal approximation to y in this 
subspace is its orthogonal projection yf , For V]t = (vi, . . . , where the v, form an orthonormal basis of 'KkiA, x), an 
orthogonal projector is given by P — VkV k , and we have 

y = f(A)x » yj; = Pf(A)x. (8) 

However, to compute this projection on the Krylov subspace we already need y, which is the quantity we wanted to 
determine in the first place. Thus, we need to replace this exact projection by an approximation. To reduce the large 
dimensionality of the problem one typically projects A on the Krylov subspace using A* = PAP. The projected matrix 
A* has dimension n but rank at most k. The ^-dimensional image of the projected matrix A* is defined by the matrix 
Hk = VlAVk, which is often referred to as Ritz matrix. The components of Hk are the projection coefficients of Ak in 
the basis Vfc, as Ak and Hk are related by Ak = VkH^Vl (in analogy to the vector case). 

The Krylov-Ritz approximation |[T6l[T7l to f(A) consists in taking the function of the Ritz matrix Hk and lifting it 
back to the full n-dimensional space, 

f(A) » Vkf(Hk)Vl (9) 

This approximation actually replaces the polynomial interpolating / at the eigenvalues of A by the polynomial inter- 
polating / at the eigenvalues of Hk, also called Ritz values 07). Substituting the approximation (|9]) in f(A)x yields 

y * V k f(HkWlx = \x\V k f(Hk)ef\ (10) 

where we choose vi collinear with x, i.e., v\ = Vkef = x/\x\, with e\ the first unit vector of C*. To evaluate the 
approximation ( fTO) we do not need to perform the matrix multiplications of Eq. |9]l explicitly. First, one computes the 
function f(Hk) of the ^-dimensional Ritz matrix to high accuracy, using the spectral definition Q or a matrix-iterative 
method. Then, the final approximation is simply a linear combination of the basis vectors v*, with coefficients given 
by the first column of /(///.) multiplied with \x\. 

The Krylov-Ritz approximation described above uses an orthonormal basis of KkiA, x). For the Hermitian case 



such a basis can efficiently be constructed using the Lanczos algorithm, which we listed in Appendix A for com- 
pleteness. It generates an orthonormal basis and a tridiagonal symmetric Hk using a three-term recurrence relation. 
The non-Hermitian case is more laborious as the construction of an orthonormal basis is typically performed using 
the Arnoldi algorithm, which suffers from long recurrences as each basis vector has to be orthogonalized with respect 
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to all the previous ones. The two-sided Lanczos algorithm is a suitable alternative fTH which uses two three-term 
recurrences to construct bases V k = (vi, . . . , v k ) and W k = (yv\ , . . . w k ) of the right, r espectively left, Krylov subspaces 



"Kk(A, x) and "Kk(A^ , x), which are biorthonormal, i.e., v\ Wj = Sy (see Appendix B for a listing of the algorithm). The 
lack of orthogonality of the basis V k prevents the construction of the orthogonal projector needed for the Krylov-Ritz 
function approximation d9). Nevertheless, the biorthonormality between V k and W k can be used to construct an oblique 
projector P — V k Wl on the right Krylov subspace. The oblique projection of A is A k = PAP and its A:-dimensional 
image is defined by Hk - WtAV k , which we call two-sided Ritz matrix, such that A k = VkH k wl. The matrix H k 
generated by the two-sided Lanczos algorithm is tridiagonal. The two-sided Krylov-Ritz approximation to f(A) then 
consists in taking the matrix function of H k and lifting it back to the original space, 

f(A) * V k f(H k )Wl (11) 

After applying this approximation of /(A) to x we find an expression which is similar to Eq. (ffOji, 



y * V k f(H k )Wlx = \x\V k f(H k )ef\ (12) 

where the last step assumes that v\ = V k e® = x/\x\. The price paid to achieve short recurrences in the non-Hermitian 
case is the loss of orthogonality of the projection on the Krylov subspace, which translates in a somewhat lower 
accuracy of the two-sided Lanczos approximation compared to the Arnoldi approximation, for equal Krylov subspace 
sizes. Nevertheless, the large gain in speed makes it by far the more efficient method lfT2l . 

In the case where / is the sign function, the approximations ( p~0] > and (12\ require the computation of sgn(H k ). 
Although it could be computed directly with the spectral definition Q, matrix-iterative methods are often cheaper for 
medium sized matrices. We choose to employ the Roberts-Higham iteration (RHi) fl8l : Set Sq - H k and compute 

S n+l = hlS n +S- 1 ). (13) 

This iteration converges quadratically to sgn(Hk), if the sign function for complex arguments is defined by Eq. ([SJ. 
The matrix inversion scales like k 3 and so will the RHi. For the QCD application considered here, typically 7 to 10 
iterations are necessary to converge within machine precision lfTTlfT2l . 

The hope is that the Krylov-Ritz approximations (jTOf and ( fL?) are accurate for k <K n. The method is known 
to work very well as long as no eigenvalues are close to a function discontinuity. However, for the sign function 
this method suffers from the sign discontinuity along the imaginary axis. If A has eigenvalues close to this discon- 
tinuity the approximating polynomial must steeply change from -1 to +1 over a small interval to give an accurate 
approximation. This cannot be achieved with a low order polynomial, i.e., the Krylov subspace must be large, which 
makes the algorithm expensive. The common solution to this problem is to use deflation, where the contribution of 
the eigencomponents associated to these critical eigenvalues to the sign function is computed exactly^ The Krylov 
subspace approximation is then performed in a deflated space, i.e., the subspace where the directions along the critical 
eigenvectors have been removed. We refer to the literature for details IfTTII . 

The convergence of the Krylov-Ritz approximations to the matrix sign function is illustrated in Fig.[T| the Lanczos 
approximation for the Hermitian case on the left, and the two-sided Lanczos approximation for the non-Hermitian case 
on the right. The accuracy of the approximation cannot be determined by comparing to the exact value sgn(A)jt, as its 
evaluation by direct methods is too costly if A is large. To obtain an estimate for the error, we compute x w sgn(A) 2 x 
(by applying the Krylov-Ritz approximation twice in succession), which should equal x if the approximation to the 
sign function were exact, and then take e = \x — x\l2\x\ as a measure for the error. This error estimate proved to be 
consistent with the true error obtained by comparing the approximation to the exact solution for 4 4 and 6 4 lattices, and 
will therefore be used for all lattice sizes. Here, and in all subsequent tests, we choose the test vector x = (1, . . . , 1). As 
expected, the accuracy improves with increasing Krylov subspace size k, and a larger deflation gap A, corresponding 
to a higher number of deflated eigenvectors, leads to a faster convergence. For a given accuracy and equal deflation 
gap, the subspace size k required for non-Hermitian A is larger than for Hermitian A. 



'in practice we deflated the eigenvalues with smallest modulus \A\ instead of those with smallest absolute real part |Re/l|, as the former are 
more efficiently determined numerically, and both choices yield almost identical deflations for the operator -ysD w (/i) of Eq. {TJ. The reason for this 
is that, as long as the chemical potential /.t is not unusually large, the spectrum looks like a very narrow bow-tie shaped strip along the real axis, 
and the sets of eigenvalues with smallest absolute real parts and smallest magnitudes will nearly coincide. In the following we therefore define the 
deflation gap A as the largest deflated eigenvalue in magnitude, i.e., A = max |/!d e fl|. 
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Figure 1 : Accuracy of the Krylov subspace approximation for y = sgn(A)x, where A is y$D w (p) for a 6 4 lattice (for a lattice volume V the matrix 
ysZ) w has dimension 12V, such that dim(A) = 15552 here). Left pane: Hermitian case (ji = 0) using the Lanczos method, right pane: non-Hermitian 
case with chemical potential /i = 0.3 using the two-sided Lanczos method. The relative error e is shown as a function of the Krylov subspace size 
k for different deflation gaps A (given in parenthesis). 

To analyze the efficiency of the algorithm we briefly sketch the three major contributions to the total CPU time. For 
each matrix A the deflation requires the computation of the critical eigenvalues and the corresponding eigenvectors. 
The time needed by the rest of the algorithm strongly depends on the eigenvalue gap, as the Krylov subspace size can 
be reduced if the deflation gap is increased. As mentioned at the beginning of this section, the product f(A)x is usually 
needed for many source vectors x, e.g., as part of an iterative inversion. In this case the expensive deflation of A only 
needs to be performed once in an initialization step, while the Krylov subspace part of the algorithm will be repeated 
for each new vector x. For this reason we assume from now on that an initial deflation has been performed and we 
will concentrate on the efficiency of the Krylov subspace part of the algorithm. We discern two main components 
in the Krylov-Ritz method: the construction of the Krylov basis using the Lanczos or two-sided Lanczos algorithms, 
where the computation time grows linearly with the subspace size k, and the RHi to compute sgnf^), which scales 
as £ 3 . Figure [2] illustrates these last two contributions. For high accuracy the Krylov subspace becomes large such 
that the cost of the RHi dominates the total CPU time of the Krylov-Ritz approximation and the method becomes 
too costly. In the following, the implementation of the Krylov-Ritz approximation for which sgn(//t) is computed 
using Eq. (fT3]l will be referred to as non-nested method. In the next section we will present a nested Krylov subspace 
method, which drastically reduces the cost to compute sga(Ht)e\ and vastly improves the overall efficiency of the 
Krylov-Ritz approximation. 

4. Nested Krylov subspace method for the sign function 

4.1. Nesting and preconditioning 

We introduce a new method which speeds up the expensive computation of the vector sgaiH^ef^ required in the 
Krylov-Ritz approximations ( p~0] > and (jT2j to sgn(A)x. The idea is to approximate this matrix-vector product by a 
further Krylov-Ritz approximation, using a second, nested Krylov subspace (specified below) of size t <k k, i.e., 

sgn(H k )e[ k) = V f sgn(H c )e { {\ (14) 

where V> is the matrix containing the basis vectors of the inner Krylov subspace, constructed with the Lanczos or 
two-sided Lanczos method, and He is the inner Ritz or two-sided Ritz matrix. The sgn(H( ) is computed using the RHi 
on the inner Ritz matrix H( . After substituting this result in Eq. ( fT0] i and ( [T2] i we get the nested approximation 

y*\x\V k V e sgn(H e )e\ f) (15) 
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Figure 2: CPU time t (in seconds) versus accuracy for an 8 4 lattice configuration in the Hermitian case with deflation gap A = 0.055 (left) and the 
non-Hermitian case with /j = 0.3 and deflation gap A = 0.107 (right). The full line shows the total time required to compute sgn(A)x, while the 
dashed line gives the time needed to construct the Krylov basis. The difference between both lines represents the time taken by the RHi to compute 
sgn(//^). The irregular convergence pattern for the non-Hermitian case is a well-known feature of the two-sided Lanczos algorithm. 



to sgn(A)jt. By introducing an additional Krylov subspace, the number of operations necessary to compute ^ga{H^)e\ 
is reduced from 0(k 3 ) in the non-nested method to 0(fi) + O(kt). If i <k k this will very much improve the efficiency 
of the Krylov-Ritz approximation. 

The obvious choice for the inner Krylov subspace is < K((Hk, ef*). However, it is easy to see that approximations 
in this Krylov subspace will not improve the efficiency of the method. The Ritz matrix H( of the Krylov subspace 
r Ke(Hi c , e^) will only contain information coming from the I x t upper left comer of Hk, because of the tridiagonal 
nature of Hk and the sparseness of the source vector ef\ This will effectively cut down the size of the outer Krylov 
subspace from k to I, which will substantially worsen the accuracy of the approximation if t is chosen much smaller 
than k. Nonetheless, the nested Krylov subspace method can be made to work efficiently if we perform an initial 
preconditioning step on the tridiagonal Ritz matrix, replacingF] 

H k ^H' k = \[pH k + {pH k y l ], (16) 

with p a positive real number, and construct the approximation to sgn(fljt)*i m the Krylov subspace r K({H' k ,ef ) ). This 
alternate Krylov subspace can be used to compute sgn(i?fc)eP because the transformation leaves the sign unchanged. 
To show this, we note that both matrices have identical eigenvectors, as a matrix and its inverse share the same 
eigenvectors, and that the sign of their eigenvalues satisfies 



sgn \ [pz + ^ j = s g n Re (pz + ^ j = s g n Re (pz + j = sgn ^1 + j-^j Re(pz) 



= sgn(z), (17) 



where we used the definition |6]). Hence, sgn(H' k ) = sgn(Hk) according to Eq. Q 

As Hk is tridiagonal the cost of its inversion, required in ( [To} , is only of O(k). Moreover, as the transformation 
increases the relative gap between the spectrum and the singularity along the imaginary axis (see below), we expect a 
clear gain in efficiency for the inner Krylov-Ritz approximation, characterized by ( <sc k. 

For a Hermitian matrix the transformation induced by the preconditioning step is illustrated in Fig. [3] for real 
positive eigenvalues (for negative values the graph would be reflected with respect to the origin). The factor p is chosen 
to optimize the effect of the transformation on the relative distance to the imaginary axis, which in the Hermitian case 



2 The factor 1 /2 is chosen for convenience. For p = 1 the transformation actually mimics the first step of the RHi j!3fr . 

3 If Hk is not diagonalizable, the equality can be shown by applying Eq. to the integration variable in the integral representation |5J. 
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Figure 3: Mapping of the preconditioning step z' = (pz, + l/pz)/2 for positive real eigenvalues and various values of p. 



corresponds to a minimization of the condition number. We examine the condition number for the Hermitian case, 
assuming that the spectral support of is similar to that of the original matrix A, after deflation. As can be seen 
from Fig. [3] after transformation the smallest eigenvalue (in absolute value) is z' min = 1, while the largest will be given 
by the transform of either the smallest or largest eigenvalues of Hk- The smallest condition number will be achieved 
when both values are identical, i.e., for p satisfying 



U 1 

- [Pimm + 



1 



where z m ; n = min \z\ and z max = max \z\, for z in the spectrum of H^, and the largest transformed eigenvalue will be 



Popt 



1 



(18) 



1 



In the Hermitian case, the transformation (jT6]» therefore reduces the condition number C by a factor 



C 

c 7 



(19) 



(20) 



The effect of the preconditioning of the Ritz matrix for a typical spectrum of ys£> w in lattice QCD is illustrated in 
Fig.|4]for the Hermitian case. The top and bottom graphs depict the spectra of and H' k , respectively. The spectrum 
of the original Ritz matrix has only a small gap at zero, while the gap for the transformed matrix is large. In this 
example, the condition number is almost improved by a factor 20. In general, the value of z max for y^D^ varies only 
slightly with the choice of the simulation parameters and T will mainly depend on the deflation gap. 

For the non-Hermitian case, let us assume that the complex spectrum is contained in the circles C(-m, r) U C(m, r), 
with real center m > and radius r < m. The optimal p, maximizing the relative distance from the imaginary axis for 
the transformed spectrum, is still given by Eq. ( p~8] > which now simplifies to p opl = (m 2 - r 2 ) -1 ^ 2 . For this choice the 
transformed eigenvalues are contained in the circles C(-m', r') U C(m' , r') with center m' = (m s + l/m s )/2 and radius 
r' = (m s - 1 /m s )/2, where m s = p piin. This is illustrated in the left panel of Fig. 5] where we show the transformation 
of a circle C{m, r) for the optimal and a sub-optimal value of p. For sub-optimal p the transformation yields an inner 
and an outer circle-like contour, which merge into the circle C(m', r') when p — > p opt . For p opt the relative distance 
from the imaginary axis will be maximal and we expect the transformation ( fl6l > to work best. The gain in efficiency 
will however not be as large as for the Hermitian case. This can be quantified by the relative distance to the imaginary 



4 In practice p opt is only known approximately, as it is computed from spectral information of A instead of . However, this has no significant 
impact on the performance of the nested method. 
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Figure 4: Upper pane: density of eigenvalues of Ht in the Hermitian case for an 8 lattice with k = 1536. The spectrum has a narrow deflation 
gap A = 0.055. The optimal p-factor (18) for the transformation \16\ is p opt ss 1.86 (using z m m = A and z max = 5.26). The lower pane shows the 
corresponding eigenvalue density of the transformed matrix H' k , where the condition number is improved by a factor T = 19.3 (see Eq. (20) ). 
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Figure 5: Mapping of the transformation zf = (pz + l/pz)/2 for 
complex z — x + f'y on a circle C(m, r) with center m = 2.55 
and radius r = 2.45 (left) and for z on the ellipse E (m, r, 0. 1 r) 
(right). In both plots the black curve shows the original j-values, 
the red curve the transformed values z' with optimal p, and the 
blue dashed lines the transformed values with a sub-optimal p. 



axis, in analogy to the calculation performed above for the Hermitian case. For the original spectrum we define the 
relative distance as 



d = 



min | Re z\ m — r 
max | Re z\ m + r 



and for the transformed spectrum 



d' = 



mini Rez'l 



1 



2 2 

m - r 



max | Re z' | m' + r' mj m 2 
The improvement factor due to the transformation is given by the ratio of these distances, yielding 

x 2 



d \ m I \ mj 



(21) 



(22) 



(23) 



where we wrote r = m— A, with A the deflation gap. When A « ra we will have T « 4. For the example shown in 
the left plot of Fig.|5]the transformation generates an improvement by a factor T = 3.84, as computed with Eq. ( |23| ). 



Figure 6: Red dots: spectrum of ffj in the non-Hermitian case for an 8 4 lattice with k = 1580, fi = 0.3 and deflation gap A = 0.107. The optimal 
p-factor j I S| for the transformation 116} is p opt x 1.335 (using Zmm = A and z max = 5.243). Blue circles: the corresponding spectrum of the 
transformed matrix H' k . As desired, the transformed eigenvalues are well away from the imaginary axis (the vertical lines at x = ±1 serve to guide 
the eye). Note the different scales on the x and y axes. 



In lattice QCD at nonzero baryon density jsD w is usually weakly non-Hermitian and, after deflation, the spectra 
are contained in ellipses E(-m, a, b) U E(+m, a, b), with center m e M. + and major and minor axes a and b along the 
real and imaginary axes, respectively. The transformation ( fl"6| > of an ellipse E(m, a, b) with aspect ratio a/b = 10 is 
illustrated in the right panel of Fig. [5] For such a narrow ellipse the transformed spectrum is qualitatively similar to 
the Hermitian case, as all the eigenvalues are transformed to the right of z' - 1, i.e., away from the imaginary axis, 
such that the high efficiency of the transformation is still guaranteed. The optimal value p opt is again determined by 
( fT"8j ) with p p t = (m 2 - a 2 )~ l l 2 , as it maximizes the relative distance from the imaginary axis. The transformation is 
illustrated for a realistic test case of lattice QCD in Fig. [6] where the eigenvalues and transformed eigenvalues of the 
Ritz matrix for y^D^in) are shown for fi = 0.3. 

As we will see below the preconditioning step significantly speeds up the Rrylov-Ritz approximation in its appli- 
cation to lattice QCD at zero and nonzero chemical potential. 



4.2. Convergence 

In this section we investigate the convergence properties of the nested method. The method was implemented 
to compute the sign function of y^D^Qj) needed by the overlap operator ([TJ, for both the Hermitian and the non- 
Hermitian case. Whenever the matrix has eigenvalues close to the imaginary axis, these critical eigenvalues are first 
deflated to open up a deflation gap, necessary to keep the Krylov subspace within a reasonable size (see Sec. (|3]l). Our 
implementation uses Chroma [19| to compute the Wilson operator. The linear algebra is performed with BLAS and 
LAPACK routines. To ensure the efficiency of the nested method a judicious implementation of the preconditioning 
step ([16}, used to construct the inner Krylov subspace, is needed. Explicitly inverting the tridiagonal matrix Hk to 
form the full matrix H',, then constructing the basis of the inner Krylov subspace by successive full matrix-vector 
multiplications would make a rather inefficient algorithm. To construct the inner Krylov subspace we do not need to 
construct the full matrix H' k explicitly, but only have to apply H' k to I — 1 vectors of (in the non-Hermitian case 
H'J is also needed). These products are best computed using the LU decomposition of Hk, which is 0{k) and thus 



especially efficient for tridiagonal matrices. A detailed listing of the algorithm is given in Appendix C 



The overall accuracy of the nested approximation ( fT5] > depends on the parameters k and I, defining the sizes of 
the outer and inner Krylov subspaces, respectively. For I — > k the solution of the nested method will converge to that 
of the non-nested method with Krylov subspace size k and accuracy s^, so its total error will also converge to s k . To 
investigate the accuracy of the nested algorithm, our strategy is to fix the outer Krylov subspace size k, corresponding 
to a certain desired accuracy, and vary the inner Krylov subspace size t. We show the convergence results for an 8 4 
lattice configuration in Fig. [7] for both the Hermitian and non-Hermitian case. As expected the nested method reaches 
the accuracy of the non-nested method when its size is large enough. Surprisingly however, this happens for I <K k, 
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Figure 7: Accuracy e of the nested method for an 8 4 lattice configuration. Hermitian case with deflation gap A = 0.055 (left) and non-Hermitian 
case with /i = 0.3 and deflation gap A = 0. 107 (right). s(k) shows how the error of the non-nested method decreases with growing Krylov subspace 
(blue line). The vertical line fixes the size k of the outer Krylov space used in the nested method. s(£) shows the accuracy of the nested method, for 
fixed k, as a function of the size I of the inner Krylov subspace (red line). The rapid convergence illustrates the efficiency of the nested method. The 
smallest value of t for which optimal convergence is reached is denoted by f opt . Note that we always restrict ourselves to even Krylov subspace 
sizes, as odd values systematically give a somewhat worse accuracy because of spurious near- zero eigenvalues occurring in the Ritz matrix. 




Figure 8: Convergence of the nested method for an 8 4 lattice configuration as a function of the relative inner Krylov subspace size C/k, for various 
deflation gaps (given in parenthesis). For each gap the value of k is chosen such that an accuracy of 10~ 8 is achieved. Left: Hermitian case with 
k = 2806, 1462 and 758 for deflation gap A = 0.025, 0.05 and 0.1. Right: non-Hermitian case with p = 0.3 and k = 3808, 1456 and 634 for 
A = 0.05, 0. 1 and 0.2. Again, the irregular convergence pattern for the non-Hermitian case is characteristic for the two-sided Lanczos algorithm. 



as the convergence of the inner Ritz approximation seems to be extremely fast. The smallest value of i for which 
optimal convergence is reached will be called { opt . The fast convergence is closely related to the large improvement 
in condition number discussed in the previous section. We also showed in Eq. ( [20] > how the improvement of the 
condition number, due to the preconditioning of the Ritz matrix Hk, depends on the deflation gap. A smaller gap will 
yield a larger improvement, and vice-versa. This in turn will influence the convergence rate of the nested method. 
Figure [8] verifies that the result £ opt <K k remains valid for different deflation gaps. The figure also illustrates that the 
somewhat larger reduction in condition number achieved for a smaller gap yields an accordingly smaller ratio £ opt /k 
(approximately proportional to the ratio of the respective improvement factors T). This is an additional advantage 
as the size reduction is largest when the outer subspace is large. In all cases, the inner Krylov subspace can be taken 
much smaller than the outer subspace, such that the efficiency of the Krylov-Ritz method is substantially boosted, as 
will be shown in the benchmarks below. 
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Figure 9: Error and CPU usage of the nested method for lattice size 8 4 . Hermitian case with deflation gap A = 0.055 and fixed k = 1536 (left), 
and non-Hermitian case with /j = 0.3, A = 0.107 and k = 1580 (right). e(() shows the accuracy versus inner Krylov subspace size t (red line). t(t) 
shows the total CPU time in seconds (solid blue line), while the horizontal dashed line measures the time needed to construct the basis in the outer 
Krylov subspace. The difference between both lines corresponds to the time taken by the RHi to compute sgn(//f). The vertical band highlights 
the operational window of the nested method, i.e., the region in I where the accuracy is optimal, but the CPU-time used to compute sgn(//() is 
negligible. 



We also verified that the convergence curves are fairly insensitive to the choice of the source vector and lattice 
configuration. The fast convergence property of the nested method is generic, regardless of the simulation details, 
for both the Hermitian and non-Hermitian case, even though the precise value of £ opt depends on the lattice size, the 
simulation parameters, the deflation gap and the desired overall accuracy (determined by k). 

4.3. Benchmarks 

With the fast convergence (£ opt <K k) discussed in the previous section, we can expect a substantial gain in com- 
putation time when using the nested method. The total CPU time consumed by the nested method is illustrated in 
Fig.|9]for the Hermitian case (left) and the non-Hermitian case (right). The size of the outer Krylov subspace is kept 
fixed, such that its construction gives a constant contribution to the run time, depicted by the horizontal dashed line. 
The contribution to the CPU time which varies with i mainly comes from the computation of sgn(H c ) with the RHi 
and is proportional to £ 3 . For I k, k the total run time of the nested method is about equal to that of the non-nested 



method. However, as illustrated by the e{€) curve (red line) and discussed in Sec. 4.2 I can be chosen much smaller 
while preserving the accuracy of the non-nested method. The central result, illustrated by the vertical band in Fig. [9| 
is that there exists an interval in i for which the accuracy is still optimal, but the CPU time needed to compute sgn(H() 
with the RHi is negligible compared to the time required to construct the outer Krylov subspace. There is therefore 
no need to make a compromise between run time and accuracy, as both can be optimized simultaneously. The error 
in this range is the minimal error achievable with the given size of the outer Krylov subspace, while the run time is 
completely dominated by the cost for building the basis in that subspace. The nested method is able to quench the 
CPU time needed for the computation of sgn(i/ J t)e ( * ) without affecting the accuracy of the Krylov-Ritz approximation. 

To evaluate the nested method further, we compare it to state-of-the-art rational approximation methods. In the 
Hermitian case the Zolotarev rational approximation, evaluated with a multi-shift conjugate gradient inverter J6), is 
routinely used in lattice simulations. In the non-Hermitian case, i.e., simulations at nonzero baryon density, overlap 
fermions are not yet commonly used because of their high cost, but recently an efficient algorithm was presented, 



which evaluates the Neuberger rational approximation using a multi-shift restarted FOM inverter [ 10 1 . In Fig. 10 we 
compare the results obtained with the nested Krylov subspace and rational approximation methods, and show how 
the CPU time varies as a function of the achieved accuracy for various lattice sizes. In all cases the Hermitian and 
non-Hermitian versions of the nested method perform better than the rational approximation method. The volume 
dependence of the run time for a fixed accuracy s can be extracted from Fig. 10 and is displayed for s = 10~ 8 



in Fig. 11 Fits to the nested method results show a volume dependence which is slightly steeper than linear, i.e., 
proportional to V 1 2 for the Hermitian case and V 1 3 for the non-Hermitian case. The comparisons clearly demonstrate 
the good efficiency of the nested method. 
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Figure 10: Comparison of the nested Krylov subspace method (filled circles) with rational approximation methods (full lines) for lattices of sizes 
4 4 , 6 4 , 8 4 and 10 4 for two different deflation gaps (given in parenthesis). Left: Hermitian case comparing the nested Lanczos approximation (Lane) 
with the Zolotarev approximation (Zolo), evaluated using the Chroma QCD library. Right: non-Hermitian case with /i = 0.3 comparing the nested 
two-sided Lanczos method (2sL) with the Neuberger approximation evaluated with a restarted FOM algorithm (rFOM). The timings were measured 
on a single 2.4 GHz Intel Core 2 core with 8 GB of memory. 
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4.4. Note on the memory usage 

In the numerical tests we observed that, for a fixed deflation gap, the Krylov subspace size needed to achieve a 
certain accuracy is almost independent of the lattice volume in the Lanczos approximation and only grows slowly 
with the volume in the two-sided Lanczos approximation. Therefore, the memory consumed by the Krylov basis 
is roughly proportional to the lattice volume. For large lattice sizes this storage requirement might become too large 
to run the Krylov-Ritz approximation on a single node. 

One solution, which only requires little storage, is to implement a double-pass version of the algorithm, which is 
possible due to the use of short recurrences. In double-pass mode only the two most recently generated basis vectors 
are stored during the construction of the outer Krylov subspace basis. In the first pass the matrix H k is built and the 
product sgn(//t)ej i) is computed with Eq. ( |14) , In the second pass the basis vectors of the outer Krylov subspace are 
generated again and immediately added in a linear combination, whose coefficients were computed in the first pass. 
The drawback of this variant is that the Krylov basis is constructed twice, such that the corresponding CPU time will 
be doubled. 

The more efficient solution is to parallelize the single-pass version of the algorithm, such that the memory require- 
ment gets distributed over the available nodes. Benchmarks on larger volumes, using such a parallel implementation, 
are currently being performed. 



4.5. Multi-level nesting 

In principle, if the inner Krylov subspace in Eq. ( fT5] l is still too large for an efficient application of the RHi on the 
inner Ritz matrix, the nested method could be applied recursively]^] In this case we rename k to Icq,{ to k\ , and add more 
recursively nested levels kj as necessary. Except for the deepest level, the matrix-vector product sgn(// J t l )e < 1 /: ' , required 
at level i will be computed with a Krylov-Ritz approximation ( fT~4-| > in the nested Krylov subspace < Kk M (H' k , ef'- 1 ), where 
H' k is defined by Eq. ( |T6] > on and typically fej+j <K kj. At the deepest level the sign function of the Ritz matrix will 
be evaluated with the RHi. This multi-level nesting is illustrated in Fig. 12 where we show the convergence curves 
for 1, 2, 3, 4 and 5 nested levels as a function of the size of the innermost Krylov subspace, with the sizes of all outer 
levels kept fixed to some value inside their convergence region (the convergence curves do not depend on the precise 
choice of the outer kfs). As before, the convergence criterion is set by the size ko of the outer Krylov subspace. Each 
additional level lowers the size of the Krylov subspace. In the case depicted in Fig. 12 the optimal Krylov subspace 
sizes, i.e. where convergence is reached, for the successive levels decreases from 1536 — > 90 — » 20 — » 8 — > 4 — > 2. 
The improvement is most dramatic for the first nested level, but fast convergence is exhibited at all levels^] This can 



5 Note that for all cases considered in the current study a single level of nesting was sufficient. 

6 For each level the p-factor for Eq. (16) is computed using Eq. (18) , using appropriately approximated boundaries for the spectrum of the Ritz 
matrix of the previous level. Note that the factor p converges to 1 as more levels are introduced, and the preconditioning step converges to the RHi. 
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Figure 12: Accuracy £ of the nested method with 1, 2, 3, 4 and 5 nesting levels for lattice size 8 in the Hermitian case with deflation gap A = 0.055 
and Icq = 1536. We plot the dependence of £ on the size kj, i = 1, . . . , 5, of the innermost Krylov space. The convergence curves are labelled with 
the number of levels in the method. For the /-level method, the outer levels kj, j = 1, .,.,( — 1 are fixed to a value in their convergence region. 



be related to the quadratically convergent RHi, as the preconditioning step at each level mimics a step of the RHi 
and compresses the spectrum more and more towards ±1. Moreover, the judicious choice of p at each level improves 
the convergence even more. It is intriguing to note that, in the example of Fig. 12 the sign of a matrix of dimension 
n = 49152 can be evaluated to an accuracy of 10~ 9 by computing the sign of a 2 x 2 matrix, which is then lifted back 
to the original n-dimensional space through linear combinations of Krylov vectors. This emphasizes again the power 
of Krylov subspace methods. 



5. Conclusions 

In this paper we have presented a nested Krylov subspace method which boosts the Krylov-Ritz approximations 
used to compute the sign function of both Hermitian and non-Hermitian matrices. The Krylov-Ritz approximation 
projects the matrix on a Krylov subspace in which it computes the sign function exactly, before lifting it back to 
the original space. Its standard implementation suffers from the CPU intensive computation of the sign of the Ritz 
matrix, which goes like the cube of the Krylov subspace size. By making an additional projection on a much smaller 
Krylov subspace, the nested method significantly reduces the total computation time of the Krylov-Ritz approxima- 
tion, without affecting its accuracy. Numerical tests showed that the nested method works equally well for Hermitian 
and non-Hermitian matrices and is more efficient than state-of-the-art rational approximation methods. Moreover, it 
exhibits a good, close to linear, volume scaling. We are currently investigating the efficiency of the nested method for 
larger lattice volumes using a parallel implementation of the algorithm. 

To end, we comment on the relation between the nested method and the extended Krylov subspace methods 
introduced in Ref. [20 1. An extended Krylov space is defined as 

K k (A,A-\x) = spm(x,Ax,A- l x,A 2 x,A- 2 x,...,A k - l x,A- k+l x), (24) 

and an approximation in that subspace approximates f(A) by the sum Q(A) — Y?-k+\ c /^'- ^ n tne nested method we 
construct the ^-dimensional Krylov subspace 'Kf(H' k , e\ ), which forms an ^-dimensional subspace of the (2{ - 1)- 
dimensional extended Krylov subspace r K({pHk,(pHkY l ,e\^). The nested method implicitly fixes the coefficients 
of the positive and negative powers of QiH^) to be equal, c_, = c,, which follows from the use of the property 
sgn(H),) = sgniHk + H7 1 ). Hence, the nested method implicitly truncates the size of the extended Krylov subspace. 

Approximations for the sign function in extended Krylov subspaces have been briefly considered recently lETl . 
however not in combination with the nesting of Krylov subspaces, i.e. the extended subspace is constructed for the 
original matrix A, not for the Ritz matrix H^. Evidently this is not feasible in the application to lattice QCD as the 
inversion of the 75-Wilson Dirac operator is too expensive in order to construct extended Krylov subspaces. 
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To conclude, we briefly consider the application of the nested method to other matrix functions. The method 
presented in Sec. [4]requires a transformation which leaves the matrix function invariant, similar to Eq. ( fT6) i for the 
sign function. If such a transformation is not known, the nested method could be adapted by using an extended Krylov 
subspace method at the inner level. This is also a topic of work in progress. 
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Appendix A. Lanczos algorithm 

Vi <— ri 
r <— Av[ 

for j ; = 1 to k do 
HU,J)*-v)r 
r <- r-H(J,j)Vj 
it j - k then 

stop 
end if 
/3<- VrV 
HUJ+ 1)«-JS 

r <- Avj+i 
r <— r - /3vy 
end for 

All H(i, j) not assigned above are zero. Consequently H is tridiagonal and symmetric. The v> are the column vectors 
of the matrix Vk. 

Appendix B. Two-sided Lanczos algorithm 

vi «- n 

W\ <— vi 

r <— Av'! 

/ <- A f wi 

for j f = 1 to k do 

#(/,/) «- wjr 
r <- r-H(j,f)Vj 

i^i-(H(jj)y W j 

it j - k then 

stop 
end if 

5 <- r 1 "/ 

if 5 = then 

serious breakdown, stop 
end if 

HU+l,j)*-0 
H(j,j+\)<-y 
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r <- Avj+i 
I <— A' wj+i 
r <— r - yVj 
Z <^l-jTWj 
end for 

The Vj and Vfj are the column vectors of the matrices V* and Wj, respectively. All H(i, j) not assigned above are zero. 
Consequently H is tridiagonal, but not symmetric as in the Hermitian case. The coefficients /3 and y are, non-uniquely, 
chosen to satisfy the biorthonormality condition 

w]vi = 6 u . (B.l) 

There are potential problems in the two-sided Lanczos process, namely serious breakdowns and near breakdowns, 
where 5 <— r*l = 0, respectively « 0, however, these were not encountered in our numerical tests. 



Appendix C. Nested algorithm 

Given a (non-)Hermitian matrix A, a source vector x and the critical eigenvectors r, (left and right eigenvectors 
and r,), with eigenvalues A,-, i = I,... ,m, do: 

1. Apply Left-Right deflation (see Ref. lPTTl ) to construct x e , where the components of the source vector x along 
the eigenvectors r, have been removed: 



Xe = x-^\ji,x)ri, 



where /, = r, for Hermitian A. 
2. Run the (two-sided) Lanczos algorithm from Appendix A (Appendix B i with A and x e to obtain V* and H^. 



3. Perform an LU decomposition of pHk, e.g., with the LAPACK routine dgttrf (zgttrf). This yields a lower 
triangular matrix L with unit diagonal and one sub-diagonal, and an upper triangular matrix U with one diagonal 
and two super-diagonals. All other entries of L and U are zero. 

4. Run the (two-sided) Lanczos algorithm with H' k = (pH^ + (pH^y l )l2 and source vector ef- 1 to construct the 
Rrylov basis V( and the Ritz matrix H{ . To do so apply H' k to each Krylov vector v: 

(a) Compute (pHkY l v using a sparse LU back substitution, e.g., with the LAPACK routine dgttrs (zgttrs). 

(b) Compute (pH^v and add to the result of (a). This tridiagonal multiply and add can be done efficiently 
using the BLAS band-matrix-vector multiplication routine dsbmv (zgbmv). 

5. Run the RHi (or any other suitable method to compute the sign function) on H( to obtain sgn(H(). 

6. The final approximation is then given by 

m 

sgn(A)x * ^ sgn(Ai)(li,x)n + \x e \V k V e sga(H e )ef . 
Note that steps (3-5) are done in real arithmetic in the Hermitian case. 
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